Introduction

The aim of this report is to develop a predictive model, based on past Airbnb listings in Amsterdam, to help new Airbnb users set competitive prices for their listings. To do so, three different models are considered. These are lasso regularized linear regression, k-nearest-neighbors, and random forest model. The best model is selected based on the predictive performance on the test set. For this purpose, the dataset is split into a training and testing set. Furthermore, each of these models requires model tuning, which is completed using 10-k cross-validation on the training set.

In the following sections, the data cleaning and feature selection process is explained, after which all the three models are tuned and tested. After this, the performance of each model is assessed, and the best model is selected. Finally, the last section discusses the recommendations based on the findings.

Setup notebook

The following libraries are used for this notebook:

# Load libraries
library(tidyverse)
library(tidymodels)
library(glmnet)
library(leaps)
library(naniar)
library(skimr)
library(knitr)
library(corrplot)
library(ranger)
library(doParallel)
library(themis)
library(vip)

Load data listings

# Read csv with listing information
data <- read_csv(gzfile("listings.csv.gz"))

Part 1: Data Cleaning and Feature Selection

Select variables of interest

An important implication for variable selection is to avoid data leakages. You have data leakage when a feature is used for in the predictive models, that at the time of the prediction cannot be available. All variables including information on the reviews provide information about a listing after it has been published. Including this variables could be a potential source of data leakage. Therefore, these variables are not included in the data set. Moreover, the variables including a description and summary about the listing can be analyzed using NLP (e.g. sentiment analysis). However, this is beyond the scope of the assignment. Therefore, these variables are excluded from the model. The variables below are included in the data set for further analysis and cleaning.

# Generate subset with variables of interest
data_sub <- data %>%
  select(id, price, property_type, room_type, accommodates, bathrooms, bedrooms,
         beds, bed_type, amenities, host_since, host_response_time,
         host_response_rate, host_neighbourhood, host_listings_count, 
         host_verifications, host_identity_verified, neighbourhood_cleansed,
         square_feet, cleaning_fee, guests_included, extra_people, 
         minimum_nights, maximum_nights, availability_30, availability_60,
         availability_90, availability_365, instant_bookable, 
         cancellation_policy, require_guest_profile_picture,
         require_guest_phone_verification, calculated_host_listings_count,
         calculated_host_listings_count_entire_homes, 
         calculated_host_listings_count_private_rooms,
         calculated_host_listings_count_shared_rooms)
# Inspect data
head(data_sub)
# Inspect data
skim(data_sub) %>% knit_print()
Name data_sub
Number of rows 20025
Number of columns 36
_______________________
Column type frequency:
character 13
Date 1
logical 4
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
price 0 1.00 5 9 0 479 0
property_type 0 1.00 3 22 0 34 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1303 0 19213 0
host_response_time 158 0.99 3 18 0 5 0
host_response_rate 158 0.99 2 4 0 61 0
host_neighbourhood 5972 0.70 4 35 0 81 0
host_verifications 0 1.00 2 158 0 381 0
neighbourhood_cleansed 0 1.00 4 38 0 22 0
cleaning_fee 3604 0.82 5 7 0 112 0
extra_people 0 1.00 5 7 0 112 0
cancellation_policy 0 1.00 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 158 0.99 2008-09-24 2019-12-06 2015-02-08 3133

Variable type: logical

skim_variable n_missing complete_rate mean count
host_identity_verified 158 0.99 0.39 FAL: 12094, TRU: 7773
instant_bookable 0 1.00 0.26 FAL: 14839, TRU: 5186
require_guest_profile_picture 0 1.00 0.01 FAL: 19819, TRU: 206
require_guest_phone_verification 0 1.00 0.01 FAL: 19758, TRU: 267

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 19117026.00 11473019.80 2818 9631819 18418621 27913527.0 40655209 ▇▇▇▅▆
accommodates 0 1.00 2.87 1.30 1 2 2 4.0 18 ▇▁▁▁▁
bathrooms 6 1.00 1.18 0.39 0 1 1 1.5 8 ▇▁▁▁▁
bedrooms 13 1.00 1.45 0.89 0 1 1 2.0 12 ▇▁▁▁▁
beds 31 1.00 1.79 1.41 0 1 1 2.0 32 ▇▁▁▁▁
host_listings_count 158 0.99 3.85 28.93 0 1 1 1.0 751 ▇▁▁▁▁
square_feet 19662 0.02 542.88 560.24 0 0 484 834.0 3229 ▇▅▁▁▁
guests_included 0 1.00 1.46 0.95 1 1 1 2.0 16 ▇▁▁▁▁
minimum_nights 0 1.00 3.43 14.74 1 2 2 3.0 1001 ▇▁▁▁▁
maximum_nights 0 1.00 613.89 548.62 1 20 1125 1125.0 11250 ▇▁▁▁▁
availability_30 0 1.00 4.46 7.81 0 0 0 6.0 30 ▇▁▁▁▁
availability_60 0 1.00 10.05 17.14 0 0 0 13.0 60 ▇▁▁▁▁
availability_90 0 1.00 15.69 26.77 0 0 0 19.0 90 ▇▁▁▁▁
availability_365 0 1.00 47.93 95.34 0 0 0 37.0 365 ▇▁▁▁▁
calculated_host_listings_count 0 1.00 1.97 5.20 1 1 1 1.0 72 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.50 5.06 0 1 1 1.0 72 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 0.38 1.08 0 0 0 0.0 16 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 0.01 0.09 0 0 0 0.0 3 ▇▁▁▁▁

Data cleaning

Basic cleaning

First, we converted all categorical and logical variables, which did not need any further cleaning, to data type factor.

# Convert categorical vars to factors 
data_sub$property_type <- factor(data_sub$property_type , 
                                 levels = unique(data_sub$property_type))
data_sub$room_type <- factor(data_sub$room_type , 
                             levels = unique(data_sub$room_type))
data_sub$bed_type <- factor(data_sub$bed_type , 
                    levels = unique(data_sub$bed_type))
data_sub$host_response_time <- factor(data_sub$ host_response_time, 
                    levels = unique(data_sub$host_response_time))
data_sub$host_neighbourhood <- factor(data_sub$host_neighbourhood, 
                    levels = unique(data_sub$host_neighbourhood))
data_sub$neighbourhood_cleansed <- factor(data_sub$neighbourhood_cleansed, 
                    levels = unique(data_sub$neighbourhood_cleansed))
data_sub$cancellation_policy <- factor(data_sub$cancellation_policy , 
                    levels = unique(data_sub$cancellation_policy))

# Convert logical variables to factors
data_sub$host_identity_verified <- factor(data_sub$host_identity_verified)
data_sub$instant_bookable <- factor(data_sub$instant_bookable)
data_sub$require_guest_profile_picture <- 
  factor(data_sub$require_guest_profile_picture)
data_sub$require_guest_phone_verification <- 
  factor(data_sub$require_guest_phone_verification)

Second, some of the variables contain numeric variables, however, they are stored in a string containing a dollar or percentage sign. The signs were removed from the strings, and the remaining numbers were converted to numeric variables.

# Remove $ sign from columns containing prices and convert to doubles
data_sub$price <- as.double(gsub("[,$]", "", data_sub$price))
data_sub$cleaning_fee <- as.double(gsub("[,$]", "", data_sub$cleaning_fee))
data_sub$extra_people <- as.double(gsub("[,$]", "", data_sub$extra_people))

# Replace "N/A" values, remove % and convert to percentage 
data_sub$host_response_rate <- na_if(data_sub$host_response_rate, "N/A")
data_sub$host_response_rate <- 
  as.double(gsub("[%]", "", data_sub$host_response_rate)) / 100

Clean amenities

The variable amenities contains all the amenities of the listing. However, this was stored in one large string, and the elements could not be accessed separately. Therefore, we cleaned the string and splitted it to a list containing the separated elements. Furthermore, we extracted all unique amenities and stored these in a vector.

# Plot correlation matrix
data_sub %>% select_if(is.numeric) %>% 
  select(availability_30, availability_60, availability_90, availability_365) %>% 
  cor() %>% corrplot()


# Remove other availability variables from data set
data_sub <- 
  data_sub %>% 
  select(-availability_60, -availability_90, -availability_365)

Create new variables based information stored in amenities

Separate variables could be created from the vector containing all unique amenities. However, since there are 130 usable amenities it seemed beyond the scope of the assignment. Moreover, some of the amenities contain information that is also given by other variables or about other amenities. For example a private bathroom could also be a strong indicator that the listing is an entire house/apartment. Also, shampoo or shower gel could be strong indicators for the presence of a bathroom. Furthermore, some amenities are very specific and apply only to a few or one house. Yet, variables are created for some of the amenities that could have an influence on the price of a listing. We created variables for wifi, pool, hot_tub and tv.

# Create variable for WIFI and add to data set
wifi <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("wifi" %in% data_sub$amenities_clean[[i]] | 
     "internet" %in% data_sub$amenities_clean[[i]]) {
    wifi[i] <-  "yes"
  } else {
    wifi[i] <-  "no"
  }
}
data_sub$wifi <- wifi
data_sub$wifi <- factor(data_sub$wifi, levels = c("yes", "no"))

# Create variable for pool and add to data set
pool <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("pool" %in% data_sub$amenities_clean[[i]] |
     "pool with pool hoist" %in% data_sub$amenities_clean[[i]]) {
    pool[i] <-  "yes"
  } else {
    pool[i] <-  "no"
  }
}
data_sub$pool <- pool
data_sub$pool <- factor(data_sub$pool, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
hot_tub <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("hot tub" %in% data_sub$amenities_clean[[i]]) {
    hot_tub[i] <-  "yes"
  } else {
    hot_tub[i] <-  "no"
  }
}
data_sub$hot_tub <- hot_tub
data_sub$hot_tub <- factor(data_sub$hot_tub, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
tv <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("tv" %in% data_sub$amenities_clean[[i]] |
     "cable tv" %in% data_sub$amenities_clean[[i]]) {
    tv[i] <-  "yes"
  } else {
    tv[i] <-  "no"
  }
}
data_sub$tv <- tv
data_sub$tv <- factor(data_sub$tv, levels = c("yes", "no"))

Clean host verification methods

The same reasoning from the amenities section above applies to the variable host_verification. The same cleaning method was used as for amenities, first we cleaned and splitted the strings, thereafter a vector with all unique host verification methods is generated.

# Clean and split strings for host verification methods
# Returns a list with all unique values
clean_verficiations <- function(x) {
  subbed <- gsub("\\[|\\]", "", tolower(x))
  subbed_complete <- gsub("[']", "", subbed)
  splitted <- str_split(subbed_complete, ",")
  clean <- sapply(splitted, function(x) str_trim(x, side = "both"))
  return(clean)
}

# Clean host_verifications
data_sub$host_verifications_clean <- 
  sapply(data_sub$host_verifications, function(x) clean_verficiations(x))

# Generate list with all unique host verification methods
verifications_unique = c()
for(verifications in data_sub$host_verifications_clean) {
  for(element in verifications) {
    if(!(element %in% verifications_unique) & element != "") {
      verifications_unique <- append(verifications_unique, str_trim(element))
    }
  }
}
verifications_unique
 [1] "email"                 "phone"                 "reviews"              
 [4] "jumio"                 "offline_government_id" "selfie"               
 [7] "government_id"         "identity_manual"       "facebook"             
[10] "work_email"            "none"                  "google"               
[13] "manual_offline"        "manual_online"         "sent_id"              
[16] "kba"                   "weibo"                 "zhima_selfie"         
[19] "sesame"                "sesame_offline"       

Create variables based on information stored in host verifications

Separate variables are created for the most common methods of verification, namely email, phone, facebook and government_id.

# Create variable for host email and add to data set
host_email <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("email" %in% data_sub$host_verifications_clean[[i]]) {
    host_email[i] <-  "yes"
  } else {
    host_email[i] <-  "no"
  }
}
data_sub$host_email <- host_email 
data_sub$host_email <- factor(data_sub$host_email, levels = c("yes", "no"))

# Create variable for phone and add to data set 
host_phone <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("phone" %in% data_sub$host_verifications_clean[[i]]) {
    host_phone[i] <-  "yes"
  } else {
    host_phone[i] <-  "no"
  }
}
data_sub$host_phone <- host_phone 
data_sub$host_phone <- factor(data_sub$host_phone, levels = c("yes", "no"))
 
# Create variable for host facebook and add to data set
host_facebook <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("facebook" %in% data_sub$host_verifications_clean[[i]]) {
    host_facebook[i] <-  "yes"
  } else {
    host_facebook[i] <-  "no"
  }
}
data_sub$host_facebook <- host_facebook 
data_sub$host_facebook <- 
  factor(data_sub$host_facebook, levels = c("yes", "no"))

# Create variable for government id 
host_government_id <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("government_id" %in% data_sub$host_verifications_clean[[i]]) {
    host_government_id[i] <-  "yes"
  } else {
    host_government_id[i] <-  "no"
  }
}
data_sub$host_government_id <- host_government_id 
data_sub$host_government_id <- 
  factor(data_sub$host_government_id, levels = c("yes", "no"))

Clean date variables

We used the variable host_since to create a new variable host_years_active, which contains information regarding the number of years a host has been active on the platform.

# Create new variable for active years host 
data_sub <- data_sub %>% 
  mutate(host_years_active = 
           as.double(as.Date("2019-12-07") - host_since) / 365)

Inspect availability variables

The variables availability_30, availability_60, availability_90 and availability_365 carry some of the same information. In order to inspect if all variables should be included in the data set, we plotted a correlation matrix. The plot below shows that all variables indicating the availability are strongly correlated. Therefore, only the variable availability_30 is included for further analysis.

# Create a train-split sets
seed_x <-  123
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)

Check missing data

In order to inspect which variables have missing cases, and how many there are in general, a table is constructed. The table shows the variables in the data set which contain missing values (in descending order). The table shows that the variable square_feet has \(19662\) missing cases, which is about \(98.19\%\). If we would deleted the missing cases, the data set will barely contain any data. Moreover, other methods for handling missing values, like replacing NA-values with the mean or median, would not be appropriate as the variables will be based on only \(1.81\%\) of the data. Therefore, square_feet is not included in the final. The variable host_response_rate has \(9349\) missing cases, which is about \(46.69\%\). The variable host_neighbourhood has \(5972\), which is about \(29.82\%\). The variable cleaning_fee has \(3604\), which is about \(18.00\%\). The variables host_response_rate, host_neighbourhood and cleaning_fee do not have as many missing values as square_feet, however, the same reasoning applies. Due to these findings, these variables are also excluded from the analyses.

# Count missing cases per variable
na_counter <- sapply(data_sub, function(x) sum(is.na(x)))
vars <- colnames(data_sub)

# Extract all variables with NA-values
na_values <- tibble(variables = vars, na_count = na_counter) 

# Check na count per variable
na_values %>%
  filter(na_count > 0) %>%
  arrange(desc(na_count))

Moreover, a check is performed on the number of cases that contain missing values if the other values that have missing cases would be included in the ‘final’ data set. The table below shows that there are 206 cases which contain missing values, which is about \(1.03\%\) of the entire data set. Since this is a very small proportion it is not very likely that deleting these cases would have a large impact on the predictions. Furthermore, all models that will be performed cannot handle missing values. Therefore, the cases with missing values are deleted.

# Compute incomplete rows
data_sub %>% 
  select(host_since, host_response_time, host_listings_count,
         host_identity_verified, beds, bedrooms, host_years_active, 
         bathrooms) %>% 
  complete.cases() %>% 
  summary(count())
   Mode   FALSE    TRUE 
logical     206   19819 
# Select variables for data set
variables_analysis <-   
  na_values %>% 
  filter(na_count <= 158) %>% 
  select(variables) %>% 
  pull(variables)

# Create final data set
data_semi_final <- data_sub %>% select(all_of(variables_analysis))
data_semi_final <- data_semi_final %>% 
  select(-c(amenities, amenities_clean, host_verifications,
            host_verifications_clean, host_since))
data_final <- data_semi_final[complete.cases(data_semi_final), ]

Here is a table describing the features included in the final data set:

Variable Name Variable Description
id Listing ID
price Log transformed price listing (€)
property_type Type of property that is rented out
room_type Type of room that is rented out
accommodates Number of people that can be accommodated in the rented property
bathrooms Number of bathrooms in the property
bedrooms Number of bedrooms in the property
beds Number of beds in the property
bed_type Type of bed present within the property
host_listings_count Number of total listings of a host
host_response_time Average time it takes for a host to respond
host_identity_verified Whether a host verified his identify on Airbnb
neighbourhood_cleansed Neighborhood location of the listed property
guests_included Number of guests included within a listing
extra_people Number of extra people allowed at the listing
minimum_nights The minimum amount of nights that a guest has to book regarding this listing
maximum_nights The maximum amount of nights that a guest is allowed to book regarding this listing
availability_30 The amount of days for which a particular host is available in 30 days
instant_bookable Whether guests can directly book without a inquiry
cancellation_policy The set policy regarding guest cancellation
require_guest_profile_picture Whether the potential guest needs a profile picture in order to book the listing
require_guest_phone_verification Whether the potential guest needs a phone verification in order to book the listing
calculated_host_listings_count The number of total host listings
calculated_host_listings_count_entire_homes The number of total entire home listings of a specific host
calculated_host_listings_count_private_rooms The number of total private room listings of a specific host
calculated_host_listings_count_shared_rooms The number of total shared room listings of a specific host
wifi Whether WiFi is present within the property
pool Whether a pool is present within the property
hot_tub Whether a hot tub is present within the property
tv Whether a tv is present within the property
host_email Whether the host verified his email on Airbnb
host_phone Whether the host verified his phone on Airbnb
host_facebook Whether the host verified his Facebook account on Airbnb
host_government_id Whether the host verified his government ID on Airbnb
host_years_active Number of years the host is active on Airbnb

Create train-test split

For further analysis the data is split into a train-test set.

# Plot distribution price
ggplot(data = data_train , aes(price)) +
  geom_histogram(col="black",
                 breaks=seq(0, max(data_train$price), by=75),
                 aes(fill=..count..)) +
  labs(title="Distribution for Price", x="Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red") +
  theme(plot.title = element_text(hjust = 0.5))

Inspecting the predicted variable

In order to prevent data leakages we only inspect the predicted variable price in the training set. In order to inspect price we have created a distribution plot. The plot shows that the data set contains some outliers and that the distribution is rightly skewed. Both the outliers and the skeweness make the data less interpretable and this could have an influence on performance of the models.

# Plot distribution price
ggplot(data = data_train , aes(log(price + 1))) +
  geom_histogram(col="black",
                 aes(fill=..count..)) +
  labs(title="Distribution for Ln Price", x="Ln Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red") +
  theme(plot.title = element_text(hjust = 0.5))

In order to prevent these potential problems we created a new distribution plot with a log transformed variable price. The plot shows that the distribution is less skewed and does not contain many large outliers. Consequently, the data is more interpretable. Therefore, we will use the log transformed price for our models.

# Log transform the price in for both training and test set
data_final$price <- log(data_final$price + 1)

# Resplit the data using the same seed 
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)
# Generate 10-fold CV sets
set.seed(321)
data_folds <- vfold_cv(data_train, v = 10)
data_folds
#  10-fold cross-validation 

K-fold cross validation

Moreover, we have generated 10-fold cross validation sets.

rm(data)
rm(data_sub)
rm(data_semi_final)
rm(na_values)

Remove data frames to avoid leakages and errors

#specify the model and engine used
lasso_linreg <- linear_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

#check that model specified correctly:
lasso_linreg %>% translate()
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    alpha = 1, family = "gaussian")

Part 2: Modelling Approaches

Linear Lasso Regularized Regression Model

In this section, a regularized regression model will be specified and trained. A lasso penalty is chosen to simultaneously perform subset selection. Therefore, a mixture is set to 1 in the model specification.

Model specification

Specification of lasso-regularized logistic regression model, where the penalty parameter will be tuned:

#prepare the recipe by setting up the regression model, setting id as id variable, combining small categories to other class, and creating dummies and normalizing variables
lasso_recipe <-  recipe(price ~ ., 
                          data = data_train) %>% 
                        update_role(id, new_role = "ID") %>%
                        step_other(property_type, bed_type,  threshold = 0.01, other = "other values") %>% 
                        step_dummy(all_nominal(), -all_outcomes()) %>%
                        step_normalize(all_predictors(), -all_outcomes())
lasso_recipe

Preprocessing recipe

In this section, the recipe is formulated. All the final dataset variables are included in the recipe, in order to perform the subset selection by means of a lasso penalty. As the property type and bed type have some categories with just a few observations, the categories that include less than 1% of the total number of observations are combined to “other” category to avoid sparse data. Additionally, dummies are created for all of the nominal variables. Lastly, all the variables are normalized.

#prepare and bake the data (on training set) to check that the recipe prepares the data correctly
data_baked <- lasso_recipe %>% prep(data_train) %>% bake(data_train)
head(data_baked)

Testing that this works properly:

#combine the model specification and reciple to a workflow
lasso_wf <- workflow() %>% 
  add_recipe(lasso_recipe) %>% 
  add_model(lasso_linreg)
lasso_wf
== Workflow ====================================================================
Preprocessor: Recipe
Model: linear_reg()

-- Preprocessor ----------------------------------------------------------------
3 Recipe Steps

* step_other()
* step_dummy()
* step_normalize()

-- Model -----------------------------------------------------------------------
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet 

Create Lasso Workflow

#set tuning grid
grid_lasso <- tibble(penalty = 10^(seq(from = -5, to = 1, length.out = 70)))

Tuning grids

Next, the \(\lambda\) parameter of the lasso model will be tuned. For that purpose, a tuning grid is specified.

 # perform grid search over the tuning grid of penalty values
 lasso_tune <- lasso_wf %>% 
  tune_grid(resamples = data_folds, 
            grid = grid_lasso,
            metrics = metric_set(mae, rmse, rsq_trad))

Tuning lasso-penalized linear regression

10-k-cross-validation is used to tune the lasso-penalized linear regression, and the metrics are plotted against the different values of \(\lambda\).

#save metrics in an object
lasso_tune_metrics <- lasso_tune %>% 
  collect_metrics()

# Plot all results metrics 
lasso_tune_metrics %>%
  ggplot(aes(x = penalty, y = mean, 
             ymin = mean - std_err, ymax = mean + std_err)) + 
  geom_linerange(alpha = 0.5, colour = "red") + 
  geom_point(colour = "red") + 
   facet_wrap(~ .metric, scale = "free_y") +
  scale_x_log10() + 
  labs(y = "Lasso Performance Metrics", x = expression(lambda))


# Plot MAE 
lasso_tune_metrics %>% filter(.metric == "mae") %>% 
  ggplot(aes(x = penalty, y = mean, 
             ymin = mean - std_err, ymax = mean + std_err)) + 
  geom_linerange(alpha = 0.5, colour = "red") + 
  geom_point(colour = "red") + 
  scale_x_log10() + 
  labs(y = "mae", x = expression(lambda), 
       title = "Lasso Regresssion MAE")

 #show best models with corresponding penalty values
lasso_tune %>% show_best("mae")

Next, the Lambda value which results in best model performance on the train set is selected. It can be seen that as the RMSE is more sensitive for large residuals, the standard errors of these metrics are larger compared to the standard errors of mean absolute error (MAE). Therefore, the mean absolute error is used to select the best model.

 #select best model according to 1 std error rule
lasso_1se_model <- select_by_one_std_err(lasso_tune, metric = "mae", desc(penalty))
lasso_1se_model

The best model is selected using the one standard error rule, where the simplest model that has MAE inside one standard error from the absolute best model is chosen to avoid overfitting.

#finalize lasso wf with the selected best model
lasso_wf_tuned <- 
  lasso_wf %>% 
  finalize_workflow(lasso_1se_model)
lasso_wf_tuned
== Workflow ====================================================================
Preprocessor: Recipe
Model: linear_reg()

-- Preprocessor ----------------------------------------------------------------
3 Recipe Steps

* step_other()
* step_dummy()
* step_normalize()

-- Model -----------------------------------------------------------------------
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.00740568469226243
  mixture = 1

Computational engine: glmnet 

As can be seen, the best model has a penalty parameter of 0.007.

Finalize the workflow:

#train the tuned model on all of the train data and test on the test data 
lasso_last_fit <- lasso_wf_tuned %>% 
  last_fit(data_split, metrics = metric_set(mae, rmse, rsq_trad))
#collect metrics from the model on the test set
lasso_test_metrics <- lasso_last_fit %>% collect_metrics()
lasso_test_metrics

The performance on the test set for this model is:

#fit the model on the training data and pull the model coefficients for the variables
lasso_wf_tuned %>% fit(data_train) %>% pull_workflow_fit() %>% tidy() 

As seen above, the final lasso model has mean absolute error of 0.27, root mean squared error of 0.37 and R squared on 48,7% on the test data.

To assess the importance of the predictor variables, model parameter estimates are calculated below:

# Tune specification
rf_tune_spec <- rand_forest(mtry = tune(), trees = 200) %>%
  set_engine("ranger") %>%
  set_mode("regression")

As lasso performs subset selection automatically, some variables have a coefficient of zero. There are multiple variables with coefficient of zero, which implies that these variables are less important for the price prediction of new Airbnb listings. The most important variables can be identified by looking at the coefficients as well, and the 4 most important variables are number of accommodates, the number of days that the airbnb is available inside 30 days, room type of entire home apartment, and lastly, Centrum-West neighbourhood.

Random Forest

Random forest specification

Recipe

Within this section, a random forest will be created. First a preprocessing recipe is created. The id variable is updated to a seperate role, instead of being a predictor.

# Specify recipe
rf_recipe <- recipe(price ~ ., data = data_train) %>%
  update_role(id, new_role = "id var")

rf_recipe
Data Recipe

Inputs:

Tune specifications

Within this section the tune specification are mentioned. The mtry is the number of features that are used at each split. The exact mtry value will be tuned later on. Different values for trees where tested (200, 500 & 1000). Increasing the amount of trees did not have much impact on the results. Therefore, a tree size of 200 is chosen to save computational time.

# Workflow creation
rf_tune_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_tune_spec)

The recipe and the model are combined into a workflow that can be tuned.

registerDoParallel()

A metric set is specified, which calculates the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE) and the R-squared (rsq_trad) is created.

# Class metrics specification 
class_metrics <- metric_set(rmse, mae, rsq_trad)

The command bellow allows us to do computations in parallel.

# Define the tune grid 
rf_tunegrid <- tibble(mtry = c(1:10))

The command grid = tibble(mtry = 1:33) was utlised, as this allows the random forest to consider all number of variables at each split. Based on the MAE criteria, a mtry of 5, 6, 7, 8 & 9 was found as the optimal solution. Afterwards, a mtry of c(1:10)) is taken that will include the optimal values, as well mtry values of 1 up until 4 and a mtry of 10. This allows us to see that the mtry is initially increased up until it reaches its optimal mtry solution.

# Plot variable importance
var_importance_plot <-
  rf_vi_fit %>%
  pull_workflow_fit() %>% vip(geom = "point", num_features = 12) +
  labs(title = "Random Forest Variable Importance") +
  theme(plot.title = element_text(hjust = 0.5)) 
rf_vi_fit
== Workflow [trained] ==========================================================
Preprocessor: Recipe
Model: rand_forest()

-- Preprocessor ----------------------------------------------------------------
0 Recipe Steps

-- Model -----------------------------------------------------------------------
Ranger result

Call:
 ranger::ranger(formula = ..y ~ ., data = data, mtry = ~7, num.trees = ~200,      importance = ~"permutation", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1)) 

Type:                             Regression 
Number of trees:                  200 
Sample size:                      13874 
Number of independent variables:  33 
Mtry:                             7 
Target node size:                 5 
Variable importance mode:         permutation 
Splitrule:                        variance 
OOB prediction error (MSE):       0.1278328 
R squared (OOB):                  0.5472826 
# Save plot for presentation
ggsave("plots/rf_var_importance.png", plot = var_importance_plot,
       height = 7 , width = 10)
# Tune the grid
set.seed(12345)
rf_tune_res <- tune_grid(
  rf_tune_wf,
  resamples = data_folds,
  grid = rf_tunegrid,
  metrics = class_metrics
)
rf_tune_res
# Tuning results
# 10-fold cross-validation 

Selecting tuning parameters

# Collect metrics
rf_tune_res %>%
  collect_metrics()

This plot illustrates the best mtry, based on the criteria of the MAE. A lower MAE would indicate a better result, as a lower value indicates a lower error of prediction.

# Plot results all metrics
rf_tune_res %>%
  collect_metrics() %>%
  filter(.metric %in% c("rmse", "mae", "rsq_trad")) %>%
  ggplot(aes(x = mtry, y = mean, ymin = mean - std_err, ymax = mean + std_err, 
             colour = .metric)) +
  geom_errorbar() + 
  geom_line() +
  geom_point() +
  facet_grid(.metric ~ ., scales = "free_y") +
  labs(title = "Random Forest performance metrics")


# Plot the MAE based
rf_tune_res %>%
  collect_metrics() %>%
  filter(.metric == "mae") %>% 
  ggplot(aes(x = mtry, y = mean, ymin = mean - std_err, ymax = mean + std_err)) +
  geom_errorbar(colour = "red") + 
  geom_line(colour = "red") +
  geom_point(colour = "red") +
  labs(y = "mae", title = "Random Forest performance metrics")

# Generate tuning grid for knn
knn_tune_grid <- tibble(neighbors = 1:50*2-1)

This command will show the best mtry based on the MAE metrics.

# Find the mtry with the best MAE
rf_tune_res %>% show_best("mae")

Best model selection

The best model based on the MAE criteria is selected and eventually finalized into the workflow.

# Best model selection
best_rmse <- select_best(rf_tune_res, "mae")
final_rf <- finalize_workflow(rf_tune_wf, best_rmse)
final_rf
== Workflow ====================================================================
Preprocessor: Recipe
Model: rand_forest()

-- Preprocessor ----------------------------------------------------------------
0 Recipe Steps

-- Model -----------------------------------------------------------------------
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 6
  trees = 200

Computational engine: ranger 

Test set performance

Now we can train the finalized workflow on our entire training set

# Finalise workflow on training set
final_res <- final_rf %>%
  last_fit(data_split, metrics = class_metrics)

The results based on the test set will be

# Score on test data
set.seed(54321)
final_res %>%
  collect_metrics()

Variable importance

Now we try to asses the variable importance. We will refit the model based on our previous tune parameters. We previousyly found an optimal mtry of 7, that’s why the mtry is specified as 7. However, do keep in mind that because of the random element within a random forest, that this initial value might alter. We noticed that the optimal mtry switches between 6, 7 & 8.

# Refit the model
rf_model_vi <- rand_forest(mtry = 7, trees = 200) %>%
  set_engine("ranger", importance = "permutation")

rf_vi_wf <- workflow() %>% 
  add_model(rf_model_vi) %>% 
  add_recipe(rf_recipe)

# Fit the model again
set.seed(12345)
rf_vi_fit <- rf_vi_wf %>% fit(data = data_train)

We can use the refitted model in order the gather the variable importance

# Variable importance 
rf_vi_fit %>% pull_workflow_fit() %>% vi()

The variable importance indicates that the accommodates, bedrooms and room_type are the most important variables for predicting the price. The variables which are the least important for predicting the price are bed_type, pool, and wifi. Pool and wifi actually have a negative importance, but as this is close to a value of 0, it is chosen to still include those variables.

# Specify model 
knn_mod <- 
  nearest_neighbor(neighbors = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale=FALSE)
knn_mod
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = tune()

Engine-Specific Arguments:
  scale = FALSE

Computational engine: kknn 
# Specify recipe
knn_recipe <- 
  recipe(price ~ ., data = data_train) %>% 
  step_rm(all_nominal()) %>%
  update_role(id, new_role = "id var") %>% 
  step_normalize(all_predictors(), -id)
knn_recipe
Data Recipe

Inputs:

Operations:

Delete terms all_nominal()
Centering and scaling for all_predictors(), -id

K-Nearest-Neighbors

Set up turning grid

# Check normalization 
train_baked <- knn_recipe %>% prep(data_train) %>% bake(data_train)
train_baked %>% head()
round(colMeans(train_baked, 8))
                                          id                                 accommodates 
                                    19106945                                            0 
                                   bathrooms                                     bedrooms 
                                           0                                            0 
                                        beds                          host_listings_count 
                                           0                                            0 
                             guests_included                                 extra_people 
                                           0                                            0 
                              minimum_nights                               maximum_nights 
                                           0                                            0 
                             availability_30               calculated_host_listings_count 
                                           0                                            0 
 calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms 
                                           0                                            0 
 calculated_host_listings_count_shared_rooms                            host_years_active 
                                           0                                            0 
                                       price 
                                           5 
round(apply(train_baked, 2, sd), 8)
                                          id                                 accommodates 
                                1.141721e+07                                 1.000000e+00 
                                   bathrooms                                     bedrooms 
                                1.000000e+00                                 1.000000e+00 
                                        beds                          host_listings_count 
                                1.000000e+00                                 1.000000e+00 
                             guests_included                                 extra_people 
                                1.000000e+00                                 1.000000e+00 
                              minimum_nights                               maximum_nights 
                                1.000000e+00                                 1.000000e+00 
                             availability_30               calculated_host_listings_count 
                                1.000000e+00                                 1.000000e+00 
 calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms 
                                1.000000e+00                                 1.000000e+00 
 calculated_host_listings_count_shared_rooms                            host_years_active 
                                1.000000e+00                                 1.000000e+00 
                                       price 
                                5.313830e-01 
rm(train_baked)

Specify a workflow

Something that should be noted for this recipe is that only numeric variables are included. This is done for the reason that categorical variables translate with difficulty to a k-nearest neighbor algorithm. The premise of prediction based on a KNN-model is that it relies exclusively on the distance between points in the data. This distance is obvious when handling numeric variables. However, when dealing with non-numeric values and variables this distance between data points cannot easily be modeled, provided they should be modeled at all. (This will have implications for determining predictions for importance and coefficients for variables, which will be addressed at the end of the section on the KNN-model).

# Specify workflow
knn_workflow <- 
  workflow() %>% 
  add_model(knn_mod) %>% 
  add_recipe(knn_recipe)
knn_workflow
== Workflow ====================================================================
Preprocessor: Recipe
Model: nearest_neighbor()

-- Preprocessor ----------------------------------------------------------------
2 Recipe Steps

* step_rm()
* step_normalize()

-- Model -----------------------------------------------------------------------
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = tune()

Engine-Specific Arguments:
  scale = FALSE

Computational engine: kknn 

The normalization of the data is ensured through the following commands:

# Store metrics in variable
metrics_reg <- metric_set(rmse, mae, rsq_trad)

# Perform grid search using validation sets
knn_tune_res <- 
  knn_workflow %>% 
  tune_grid(resamples = data_folds,
            grid = knn_tune_grid, 
            metrics = metrics_reg) 

# Plot results metrics
knn_metrics_plot <- 
  knn_tune_res %>%  collect_metrics() %>% 
  ggplot(aes(x = neighbors, y = mean)) +
  geom_point(colour = "red") + geom_line(colour = "red") +
  facet_wrap(~ .metric, scale = "free_y") +
  labs(title = "KNN Performance Metrics") +
  theme(plot.title = element_text(hjust = 0.5)) 
knn_metrics_plot


autoplot(knn_tune_res)

Below the initial workflow for the k-nearest neighbors model is specified

# Generate best model
knn_best_model <- select_best(knn_tune_res, metric = "mae")

Tuning the number of nearest neighbours

The code below serves to specify the assessment metrics that are used. Moreover, a grid search is performed using the validation sets.

# Finalize workflow
knn_workflow_final <- 
  knn_workflow %>% 
  finalize_workflow(knn_best_model)
knn_workflow_final
== Workflow ====================================================================
Preprocessor: Recipe
Model: nearest_neighbor()

-- Preprocessor ----------------------------------------------------------------
2 Recipe Steps

* step_rm()
* step_normalize()

-- Model -----------------------------------------------------------------------
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = 51

Engine-Specific Arguments:
  scale = FALSE

Computational engine: kknn 

The plot output shows some metrics that plot the mean of the performance metrics. We should aim for the MAE (mean absolute error) and RMSE (root mean square error) to be as low as possible, and the rsq_trad (R-squared) to be as high as possible. We used the MAE metric to determine the optimal k-neighbors for our model, which arrived at 51 neighbors. This can be read from the MAE graph, by looking at the corresponsing k-neighbors for the lowest mean of MAE.

Moreover, from the last plot the elbow trend can somewhat clearly be seen: the metrics reach their optimum point after which the level off and slowly increase for MAE and RMSE and decrease for rqs_trad.

The model with the optimal number of k-nearest neighbors can then be selected as follows:

# Train and test the data set
knn_last_fit <- 
  knn_workflow_final %>% 
  last_fit(data_split, 
           metrics = metrics_reg)

Finalize workflow

Below the finalized workflow is made, which automatically picks the best KNN-model defined above (which is specified by the MAE metric)

# Generate predicted values for sales
lasso_test_preds <- 
  lasso_wf_tuned %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
lasso_pred <- 
  tibble(observed = data_test$price, 
         predicted = lasso_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
lasso_residual_plot <- 
  lasso_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "Lasso Regularized Regression Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
lasso_residual_plot


# Generate predicted values for sales
set.seed(12345)
rf_test_preds <- 
  rf_vi_wf %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
rf_pred <- 
  tibble(observed = data_test$price, 
         predicted = rf_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
rf_residual_plot <- 
  rf_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "Random Forest Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
rf_residual_plot


# Generate predicted values for price
knn_test_preds <- 
  knn_workflow_final %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
knn_pred <- 
  tibble(observed = data_test$price, 
         predicted = knn_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
knn_residual_plot <- 
  knn_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "KNN Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
knn_residual_plot

Last fit

A final workflow can be set up to check the final fit. Furthermore, the performance metrics for the best KNN-model are selected and put in a table.

# Generate table with 
lasso_metrics_compare <- 
  lasso_test_metrics %>% 
  select(-.estimator) %>% 
  mutate(model = "lasso regression")
rf_metrics_compare <- 
  final_res %>%
  collect_metrics() %>% 
  mutate(model = "random forest")
knn_metrics_compare <- 
  knn_last_fit %>% 
  collect_metrics %>% 
  select(-.estimator) %>% 
  mutate(model = "knn reg")
lasso_metrics_compare %>%
  bind_rows(rf_metrics_compare, knn_metrics_compare) %>% 
  select(-.estimator) %>% 
  pivot_wider(names_from = .metric, values_from = .estimate)

KNN variable importance

KNN, as a method, does not come with a prediction for the importance or coefficients of variables. The reason for this has to do with the fact that prediction in a k-nearest neighbor model relies exclusively on the distance between data points. With this comes the added implication that no information about the relative importance of variables can be derived from it.

Part 3: Model Comparison

In order to assess the performance of the three models the results of three metrics are compared.

  1. Root mean squared error. The objective is to minimize the result of this metric. An implication is that the metric is very sensitive to observations with large absolute residuals.
  2. Mean absolute error. The objective is to minimize the result of this metric. This metric is more robust and therefore less sensitive to observations with large absolute residuals. However, because it uses the mean, it is therefore not insensitive to skewed distributions.
  3. R-squared. An attraction is that the metric is unitless and can therefore be compared across models. A downside is that is not robust, since it essentially measures correlation and not agreement.

Assess appropriateness metrics

In order to check if the metrics are an appropriate fit, the residuals distributions of the models are plotted. Since the plots below show no large outliers and no skewed distributions, there are no implications for root mean squared error and mean absolute error.

# Generate predicted values for sales
lasso_test_preds <- 
  lasso_wf_tuned %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
lasso_pred <- 
  tibble(observed = data_test$price, 
         predicted = lasso_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
lasso_residual_plot <- 
  lasso_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "Lasso Regularized Regression Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
lasso_residual_plot
# Generate predicted values for sales
set.seed(12345)
rf_test_preds <- 
  rf_vi_wf %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
rf_pred <- 
  tibble(observed = data_test$price, 
         predicted = rf_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
rf_residual_plot <- 
  rf_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "Random Forest Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
rf_residual_plot
# Generate predicted values for price
knn_test_preds <- 
  knn_workflow_final %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
knn_pred <- 
  tibble(observed = data_test$price, 
         predicted = knn_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
knn_residual_plot <- 
  knn_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "KNN Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
knn_residual_plot

Assessment metrics

The table below shows results for the metrics for the three models. When assessing the metrics and their objectives, the results show that the random forest model performs best on all metrics. Therefore, the random forest is considered to be the best model.

# Generate table with 
lasso_metrics_compare <- 
  lasso_test_metrics %>% 
  select(-.estimator) %>% 
  mutate(model = "lasso regression")
rf_metrics_compare <- 
  final_res %>%
  collect_metrics() %>% 
  mutate(model = "random forest")
knn_metrics_compare <- 
  knn_last_fit %>% 
  collect_metrics %>% 
  select(-.estimator) %>% 
  mutate(model = "knn reg")
lasso_metrics_compare %>%
  bind_rows(rf_metrics_compare, knn_metrics_compare) %>% 
  select(-.estimator) %>% 
  pivot_wider(names_from = .metric, values_from = .estimate)

Part 4: Recommendations

For this assignment we assessed three models for predicting prices on the Airbnb platform. These three models are a lasso regularized linear regression, a random forest and finally a K-Nearest Neighbours model. The models are assessed on three different metrics, namely the root mean squared error (RMSE), the mean squared error (MAE) and the r-squared. The proposed assessment metrics indicate that the random forest outperforms the lasso regularized linear regression and the K-Nearest-Neighbor in all proposed metrics.

Furthermore, the variable importance of the random forest indicates which variables lead to the highest increase in R-squared. Accommodates, bedrooms and room_type are the most important variables for predicting the price. Bed_type, pool, and wifi are the least important for predicting the price. We therefore advise hosts to specifically look at the variables with a high importance, as these variables have the highest impact on their predicted price.

As the model performance metrics were used to compare the models and to select the best one, these metrics should not be used to estimate the generalization error, or how well the model would perform on completely new data. For this purpose, it is recommended that the random forest model will be further tested in an online environment before scaling up its implementation. Additionally, in practice, the model will provide the prediction in log prices, and this has to be converted to euros. For example, a log price prediction of 4.09 would reflect price of €59.74.

How are these findings of use to Airbnb? Essentially, this project has sought to examine three approaches to competitively set listing prices of Airbnb listings. Considering our results, we advise Airbnb to use a random forest model in their projects that deal with optimal pricing of listings.

---
title: "MLLA - Group Assignment"
author:
- Goswin Nibbering - 563285
- Kylian van Noordenne - 450752
- Noora Mattsson - 477100
- Eline van Groningen - 414846
date: "11/25/2020"
output:
  html_document:
    df_print: paged
---

# Introduction

The aim of this report is to develop a predictive model, based on past Airbnb listings in Amsterdam, to help new Airbnb users set competitive prices for their listings. To do so, three different models are considered. These are lasso regularized linear regression, k-nearest-neighbors, and random forest model. The best model is selected based on the predictive performance on the test set. For this purpose, the dataset is split into a training and testing set. Furthermore, each of these models requires model tuning, which is completed using 10-k cross-validation on the training set. 

In the following sections, the data cleaning and feature selection process is explained, after which all the three models are tuned and tested. After this, the performance of each model is assessed, and the best model is selected. Finally, the last section discusses the recommendations based on the findings.

# Setup notebook

The following libraries are used for this notebook:
```{r message = FALSE, warning = FALSE}
# Load libraries
library(tidyverse)
library(tidymodels)
library(glmnet)
library(leaps)
library(naniar)
library(skimr)
library(knitr)
library(corrplot)
library(ranger)
library(doParallel)
library(themis)
library(vip)
```

# Load data listings
```{r message = FALSE, warning = FALSE}
# Read csv with listing information
data <- read_csv(gzfile("listings.csv.gz"))
```

# Part 1: Data Cleaning and Feature Selection

## Select variables of interest

An important implication for variable selection is to avoid data leakages. You have data leakage when a feature is used for in the predictive models, that at the time of the prediction cannot be available. All variables including information on the reviews provide information about a listing after it has been published. Including this variables could be a potential source of data leakage. Therefore, these variables are not included in the data set. Moreover, the variables including a description and summary about the listing can be analyzed using NLP (e.g. sentiment analysis). However, this is beyond the scope of the assignment. Therefore, these variables are excluded from the model. The variables below are included in the data set for further analysis and cleaning.
```{r message = FALSE}
# Generate subset with variables of interest
data_sub <- data %>%
  select(id, price, property_type, room_type, accommodates, bathrooms, bedrooms,
         beds, bed_type, amenities, host_since, host_response_time,
         host_response_rate, host_neighbourhood, host_listings_count, 
         host_verifications, host_identity_verified, neighbourhood_cleansed,
         square_feet, cleaning_fee, guests_included, extra_people, 
         minimum_nights, maximum_nights, availability_30, availability_60,
         availability_90, availability_365, instant_bookable, 
         cancellation_policy, require_guest_profile_picture,
         require_guest_phone_verification, calculated_host_listings_count,
         calculated_host_listings_count_entire_homes, 
         calculated_host_listings_count_private_rooms,
         calculated_host_listings_count_shared_rooms)
```

```{r}
# Inspect data
head(data_sub)
```

```{r}
# Inspect data
skim(data_sub) %>% knit_print()
```

## Data cleaning

### Basic cleaning

First, we converted all categorical and logical variables, which did not need any further cleaning, to data type factor.
```{r Converting vectors}
# Convert categorical vars to factors 
data_sub$property_type <- factor(data_sub$property_type , 
                                 levels = unique(data_sub$property_type))
data_sub$room_type <- factor(data_sub$room_type , 
                             levels = unique(data_sub$room_type))
data_sub$bed_type <- factor(data_sub$bed_type , 
                    levels = unique(data_sub$bed_type))
data_sub$host_response_time <- factor(data_sub$ host_response_time, 
                    levels = unique(data_sub$host_response_time))
data_sub$host_neighbourhood <- factor(data_sub$host_neighbourhood, 
                    levels = unique(data_sub$host_neighbourhood))
data_sub$neighbourhood_cleansed <- factor(data_sub$neighbourhood_cleansed, 
                    levels = unique(data_sub$neighbourhood_cleansed))
data_sub$cancellation_policy <- factor(data_sub$cancellation_policy , 
                    levels = unique(data_sub$cancellation_policy))

# Convert logical variables to factors
data_sub$host_identity_verified <- factor(data_sub$host_identity_verified)
data_sub$instant_bookable <- factor(data_sub$instant_bookable)
data_sub$require_guest_profile_picture <- 
  factor(data_sub$require_guest_profile_picture)
data_sub$require_guest_phone_verification <- 
  factor(data_sub$require_guest_phone_verification)
```

Second, some of the variables contain numeric variables, however, they are stored in a string containing a dollar or percentage sign. The signs were removed from the strings, and the remaining numbers were converted to numeric variables.
```{r Convert }
# Remove $ sign from columns containing prices and convert to doubles
data_sub$price <- as.double(gsub("[,$]", "", data_sub$price))
data_sub$cleaning_fee <- as.double(gsub("[,$]", "", data_sub$cleaning_fee))
data_sub$extra_people <- as.double(gsub("[,$]", "", data_sub$extra_people))

# Replace "N/A" values, remove % and convert to percentage 
data_sub$host_response_rate <- na_if(data_sub$host_response_rate, "N/A")
data_sub$host_response_rate <- 
  as.double(gsub("[%]", "", data_sub$host_response_rate)) / 100
```

### Clean amenities 

The variable *amenities* contains all the amenities of the listing. However, this was stored in one large string, and the elements could not be accessed separately. Therefore, we cleaned the string and splitted it to a list containing the separated elements. Furthermore, we extracted all unique amenities and stored these in a vector. 
```{r}
# Clean and split strings for amenities
# Returns a list with all unique values
clean_amenities <- function(x) {
  subbed <- gsub('[{}\"]', "", tolower(x))
  splitted <- str_split(subbed, ",")
  clean <- sapply(splitted, function(x) str_trim(x, side = "both"))
  return(clean)
}

# Clean amenities
data_sub$amenities_clean <- 
  sapply(data_sub$amenities, function(x) clean_amenities(x))

# Create vector with all unique amenities
amenities_unique = c()
for(amenities in data_sub$amenities_clean) {
  for(element in amenities) {
    if(!(element %in% amenities_unique) & element != "") {
      amenities_unique <- append(amenities_unique, str_trim(element))
    }
  }
}
amenities_unique
```

#### Create new variables based information stored in amenities

Separate variables could be created from the vector containing all unique amenities.  However, since there are 130 usable amenities it seemed beyond the scope of the assignment. Moreover, some of the amenities contain information that is also given by other variables or about other amenities. For example a private bathroom could also be a strong indicator that the listing is an entire house/apartment. Also, shampoo or shower gel could be strong indicators for the presence of a bathroom. Furthermore, some amenities are very specific and apply only to a few or one house. Yet, variables are created for some of the amenities that could have an influence on the price of a listing. We created variables for *wifi*, *pool*, *hot_tub* and *tv*. 
```{r Create new variables for amenities}
# Create variable for WIFI and add to data set
wifi <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("wifi" %in% data_sub$amenities_clean[[i]] | 
     "internet" %in% data_sub$amenities_clean[[i]]) {
    wifi[i] <-  "yes"
  } else {
    wifi[i] <-  "no"
  }
}
data_sub$wifi <- wifi
data_sub$wifi <- factor(data_sub$wifi, levels = c("yes", "no"))

# Create variable for pool and add to data set
pool <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("pool" %in% data_sub$amenities_clean[[i]] |
     "pool with pool hoist" %in% data_sub$amenities_clean[[i]]) {
    pool[i] <-  "yes"
  } else {
    pool[i] <-  "no"
  }
}
data_sub$pool <- pool
data_sub$pool <- factor(data_sub$pool, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
hot_tub <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("hot tub" %in% data_sub$amenities_clean[[i]]) {
    hot_tub[i] <-  "yes"
  } else {
    hot_tub[i] <-  "no"
  }
}
data_sub$hot_tub <- hot_tub
data_sub$hot_tub <- factor(data_sub$hot_tub, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
tv <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("tv" %in% data_sub$amenities_clean[[i]] |
     "cable tv" %in% data_sub$amenities_clean[[i]]) {
    tv[i] <-  "yes"
  } else {
    tv[i] <-  "no"
  }
}
data_sub$tv <- tv
data_sub$tv <- factor(data_sub$tv, levels = c("yes", "no"))
```

### Clean host verification methods

The same reasoning from the *amenities* section above applies to the variable *host_verification*. The same cleaning method was used as for amenities, first we cleaned and splitted the strings, thereafter a vector with all unique host verification methods is generated.
```{r Generate unique host verification methods}
# Clean and split strings for host verification methods
# Returns a list with all unique values
clean_verficiations <- function(x) {
  subbed <- gsub("\\[|\\]", "", tolower(x))
  subbed_complete <- gsub("[']", "", subbed)
  splitted <- str_split(subbed_complete, ",")
  clean <- sapply(splitted, function(x) str_trim(x, side = "both"))
  return(clean)
}

# Clean host_verifications
data_sub$host_verifications_clean <- 
  sapply(data_sub$host_verifications, function(x) clean_verficiations(x))

# Generate list with all unique host verification methods
verifications_unique = c()
for(verifications in data_sub$host_verifications_clean) {
  for(element in verifications) {
    if(!(element %in% verifications_unique) & element != "") {
      verifications_unique <- append(verifications_unique, str_trim(element))
    }
  }
}
verifications_unique
```

#### Create variables based on information stored in host verifications

Separate variables are created for the most common methods of verification, namely *email*, *phone*, *facebook* and *government_id*. 
```{r Create new variables for host verification method}
# Create variable for host email and add to data set
host_email <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("email" %in% data_sub$host_verifications_clean[[i]]) {
    host_email[i] <-  "yes"
  } else {
    host_email[i] <-  "no"
  }
}
data_sub$host_email <- host_email 
data_sub$host_email <- factor(data_sub$host_email, levels = c("yes", "no"))

# Create variable for phone and add to data set 
host_phone <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("phone" %in% data_sub$host_verifications_clean[[i]]) {
    host_phone[i] <-  "yes"
  } else {
    host_phone[i] <-  "no"
  }
}
data_sub$host_phone <- host_phone 
data_sub$host_phone <- factor(data_sub$host_phone, levels = c("yes", "no"))
 
# Create variable for host facebook and add to data set
host_facebook <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("facebook" %in% data_sub$host_verifications_clean[[i]]) {
    host_facebook[i] <-  "yes"
  } else {
    host_facebook[i] <-  "no"
  }
}
data_sub$host_facebook <- host_facebook 
data_sub$host_facebook <- 
  factor(data_sub$host_facebook, levels = c("yes", "no"))

# Create variable for government id 
host_government_id <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("government_id" %in% data_sub$host_verifications_clean[[i]]) {
    host_government_id[i] <-  "yes"
  } else {
    host_government_id[i] <-  "no"
  }
}
data_sub$host_government_id <- host_government_id 
data_sub$host_government_id <- 
  factor(data_sub$host_government_id, levels = c("yes", "no"))
```

### Clean date variables

We used the variable *host_since* to create a new variable *host_years_active*, which contains information regarding the number of years a host has been active on the platform. 
```{r Create new variable}
# Create new variable for active years host 
data_sub <- data_sub %>% 
  mutate(host_years_active = 
           as.double(as.Date("2019-12-07") - host_since) / 365)
```

### Inspect availability variables

The variables *availability_30*, *availability_60*, *availability_90* and *availability_365* carry some of the same information. In order to inspect if all variables should be included in the data set, we plotted a correlation matrix. The plot below shows that all variables indicating the availability are strongly correlated. Therefore, only the variable *availability_30* is included for further analysis. 
```{r}
# Plot correlation matrix
data_sub %>% select_if(is.numeric) %>% 
  select(availability_30, availability_60, availability_90, availability_365) %>% 
  cor() %>% corrplot()

# Remove other availability variables from data set
data_sub <- 
  data_sub %>% 
  select(-availability_60, -availability_90, -availability_365) 
```

### Check missing data

In order to inspect which variables have missing cases, and how many there are in general, a table is constructed. The table shows the variables in the data set which contain missing values (in descending order). The table shows that the variable *square_feet* has $19662$ missing cases, which is about $98.19\%$. If we would deleted the missing cases, the data set will barely contain any data. Moreover, other methods for handling missing values, like replacing NA-values with the mean or median, would not be appropriate as the variables will be based on only $1.81\%$ of the data. Therefore, *square_feet* is not included in the final. The variable *host_response_rate* has $9349$ missing cases, which is about $46.69\%$. The variable *host_neighbourhood* has $5972$, which is about $29.82\%$. The variable *cleaning_fee* has $3604$, which is about $18.00\%$. The variables  *host_response_rate*, *host_neighbourhood* and *cleaning_fee* do not have as many missing values as *square_feet*, however, the same reasoning applies. Due to these findings, these variables are also excluded from the analyses. 
```{r Missing data check}
# Count missing cases per variable
na_counter <- sapply(data_sub, function(x) sum(is.na(x)))
vars <- colnames(data_sub)

# Extract all variables with NA-values
na_values <- tibble(variables = vars, na_count = na_counter) 

# Check na count per variable
na_values %>%
  filter(na_count > 0) %>%
  arrange(desc(na_count))
```

Moreover, a check is performed on the number of cases that contain missing values if the other values that have missing cases would be included in the 'final' data set. The table below shows that there are 206 cases which contain missing values, which is about $1.03\%$ of the entire data set. Since this is a very small proportion it is not very likely that deleting these cases would have a large impact on the predictions. Furthermore, all models that will be performed cannot handle missing values. Therefore, the cases with missing values are deleted. 
```{r Incomplete cases}
# Compute incomplete rows
data_sub %>% 
  select(host_since, host_response_time, host_listings_count,
         host_identity_verified, beds, bedrooms, host_years_active, 
         bathrooms) %>% 
  complete.cases() %>% 
  summary(count())
```

```{r Generate final data set}
# Select variables for data set
variables_analysis <-   
  na_values %>% 
  filter(na_count <= 158) %>% 
  select(variables) %>% 
  pull(variables)

# Create final data set
data_semi_final <- data_sub %>% select(all_of(variables_analysis))
data_semi_final <- data_semi_final %>% 
  select(-c(amenities, amenities_clean, host_verifications,
            host_verifications_clean, host_since))
data_final <- data_semi_final[complete.cases(data_semi_final), ]
```

Here is a table describing the features included in the final data set:

| Variable Name                                   | Variable Description                                                                                     
|-------------------------------------------------|----------------------------------------------------------------------------------------------------------
| id                                              | Listing ID
| price                                           | Log transformed price listing (€)                                                        
| property_type                                   | Type of property that is rented out
| room_type                                       | Type of room that is rented out
| accommodates                                    | Number of people that can be accommodated in the rented property
| bathrooms                                       | Number of bathrooms in the property
| bedrooms                                        | Number of bedrooms in the property
| beds                                            | Number of beds in the property
| bed_type                                        | Type of bed present within the property
| host_listings_count                             | Number of total listings of a host
| host_response_time                              | Average time it takes for a host to respond
| host_identity_verified                          | Whether a host verified his identify on Airbnb 
| neighbourhood_cleansed                          | Neighborhood location of the listed property
| guests_included                                 | Number of guests included within a listing 
| extra_people                                    | Number of extra people allowed at the listing
| minimum_nights                                  | The minimum amount of nights that a guest has to book regarding this listing
| maximum_nights                                  | The maximum amount of nights that a guest is allowed to book regarding this listing
| availability_30                                 | The amount of days for which a particular host is available in 30 days
| instant_bookable                                | Whether guests can directly book without a inquiry
| cancellation_policy                             | The set policy regarding guest cancellation 
| require_guest_profile_picture                   | Whether the potential guest needs a profile picture in order to book the listing
| require_guest_phone_verification                | Whether the potential guest needs a phone verification in order to book the listing
| calculated_host_listings_count                  | The number of total host listings
| calculated_host_listings_count_entire_homes     | The number of total entire home listings of a specific host
| calculated_host_listings_count_private_rooms    | The number of total private room listings of a specific host
| calculated_host_listings_count_shared_rooms     | The number of total shared room listings of a specific host
| wifi                                            | Whether WiFi is present within the property
| pool                                            | Whether a pool is present within the property 
| hot_tub                                         | Whether a hot tub is present within the property
| tv                                              | Whether a tv is present within the property 
| host_email                                      | Whether the host verified his email on Airbnb
| host_phone                                      | Whether the host verified his phone on Airbnb
| host_facebook                                   | Whether the host verified his Facebook account on Airbnb
| host_government_id                              | Whether the host verified his government ID on Airbnb
| host_years_active                               | Number of years the host is active on Airbnb


## Create train-test split

For further analysis the data is split into a train-test set.
```{r}
# Create a train-split sets
seed_x <-  123
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)
```

### Inspecting the predicted variable

In order to prevent data leakages we only inspect the predicted variable *price* in the training set. In order to inspect *price* we have created a distribution plot. The plot shows that the data set contains some outliers and that the distribution is rightly skewed. Both the outliers and the skeweness make the data less interpretable and this could have an influence on performance of the models. 
```{r}
# Plot distribution price
ggplot(data = data_train , aes(price)) +
  geom_histogram(col="black",
                 breaks=seq(0, max(data_train$price), by=75),
                 aes(fill=..count..)) +
  labs(title="Distribution for Price", x="Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red") +
  theme(plot.title = element_text(hjust = 0.5))
```

In order to prevent these potential problems we created a new distribution plot with a log transformed variable *price*. The plot shows that the distribution is less skewed and does not contain many large outliers. Consequently, the data is more interpretable. Therefore, we will use the log transformed *price* for our models. 
```{r}
# Plot distribution price
ggplot(data = data_train , aes(log(price + 1))) +
  geom_histogram(col="black",
                 aes(fill=..count..)) +
  labs(title="Distribution for Ln Price", x="Ln Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red") +
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
# Log transform the price in for both training and test set
data_final$price <- log(data_final$price + 1)

# Resplit the data using the same seed 
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)

```

### K-fold cross validation 

Moreover, we have generated 10-fold cross validation sets.
```{r}
# Generate 10-fold CV sets
set.seed(321)
data_folds <- vfold_cv(data_train, v = 10)
data_folds
```

### Remove data frames to avoid leakages and errors
```{r}
rm(data)
rm(data_sub)
rm(data_semi_final)
rm(na_values)
```

# Part 2: Modelling Approaches

## Linear Lasso Regularized Regression Model

In this section, a regularized regression model will be specified and trained. A lasso penalty is chosen to simultaneously perform subset selection. Therefore, a mixture is set to 1 in the model specification. 

### Model specification

Specification of lasso-regularized logistic regression model, where the penalty parameter will be tuned:
```{r results='hide'}
# Specify the model and engine used
lasso_linreg <- linear_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

# Check that model specified correctly:
lasso_linreg %>% translate()
```

### Preprocessing recipe

In this section, the recipe is formulated. All the final dataset variables are included in the recipe, in order to perform the subset selection by means of a lasso penalty. As the property type and bed type have some categories with just a few observations, the categories that include less than 1% of the total number of observations are combined to "other" category to avoid sparse data. Additionally, dummies are created for all of the nominal variables. Lastly, all the variables are normalized.
```{r results='hide'}
# Prepare the recipe by setting up the regression model, setting id as id variable, combining small categories to other class, and creating dummies and normalizing variables
lasso_recipe <-  recipe(price ~ ., 
                          data = data_train) %>% 
                        update_role(id, new_role = "ID") %>%
                        step_other(property_type, bed_type,  threshold = 0.01, other = "other values") %>% 
                        step_dummy(all_nominal(), -all_outcomes()) %>%
                        step_normalize(all_predictors(), -all_outcomes())
lasso_recipe
```

Testing that this works properly:
```{r}
# Prepare and bake the data (on training set) to check that the recipe prepares the data correctly
data_baked <- lasso_recipe %>% prep(data_train) %>% bake(data_train)
head(data_baked)
```

### Create Lasso Workflow
```{r results='hide'}
# Combine the model specification and reciple to a workflow
lasso_wf <- workflow() %>% 
  add_recipe(lasso_recipe) %>% 
  add_model(lasso_linreg)
lasso_wf
```

### Tuning grids
Next, the $\lambda$ parameter of the lasso model will be tuned. For that purpose, a tuning grid is specified. 
```{r}
# Set tuning grid
grid_lasso <- tibble(penalty = 10^(seq(from = -5, to = 1, length.out = 70)))
```

### Tuning lasso-penalized linear regression
10-k-cross-validation is used to tune the lasso-penalized linear regression, and the metrics are plotted against the different values of $\lambda$.
```{r}
 # Perform grid search over the tuning grid of penalty values
 lasso_tune <- lasso_wf %>% 
  tune_grid(resamples = data_folds, 
            grid = grid_lasso,
            metrics = metric_set(mae, rmse, rsq_trad))
```

```{r message=TRUE}
# Store metrics in variable
lasso_tune_metrics <- lasso_tune %>% 
  collect_metrics()

# Plot all results metrics 
lasso_tune_metrics %>%
  ggplot(aes(x = penalty, y = mean, 
             ymin = mean - std_err, ymax = mean + std_err)) + 
  geom_linerange(alpha = 0.5, colour = "red") + 
  geom_point(colour = "red") + 
   facet_wrap(~ .metric, scale = "free_y") +
  scale_x_log10() + 
  labs(y = "Lasso Performance Metrics", x = expression(lambda))

# Plot MAE 
lasso_tune_metrics %>% filter(.metric == "mae") %>% 
  ggplot(aes(x = penalty, y = mean, 
             ymin = mean - std_err, ymax = mean + std_err)) + 
  geom_linerange(alpha = 0.5, colour = "red") + 
  geom_point(colour = "red") + 
  scale_x_log10() + 
  labs(y = "mae", x = expression(lambda), 
       title = "Lasso Regresssion MAE") +
  theme(plot.title = element_text(hjust = 0.5))
```

Next, the Lambda value which results in best model performance on the train set is selected. It can be seen that as the RMSE is more sensitive for large residuals, the standard errors of these metrics are larger compared to the standard errors of mean absolute error (MAE). Therefore, the mean absolute error is used to select the best model. 
```{r}
# Show best models with corresponding penalty values
lasso_tune %>% show_best("mae")
```

The best model is selected using the one standard error rule, where the simplest model that has MAE inside one standard error from the absolute best model is chosen to avoid overfitting.
```{r}
# Select best model according to 1 std error rule
lasso_1se_model <- select_by_one_std_err(lasso_tune, metric = "mae", desc(penalty))
lasso_1se_model
```

As can be seen, the best model has a penalty parameter of 0.007.

Finalize the workflow:
```{r, results='hide'}
# Finalize lasso wf with the selected best model
lasso_wf_tuned <- 
  lasso_wf %>% 
  finalize_workflow(lasso_1se_model)
lasso_wf_tuned
```

```{r}
# Train the tuned model on all of the train data and test on the test data 
lasso_last_fit <- lasso_wf_tuned %>% 
  last_fit(data_split, metrics = metric_set(mae, rmse, rsq_trad))
```

The performance on the test set for this model is:
```{r}
# Collect metrics from the model on the test set
lasso_test_metrics <- lasso_last_fit %>% collect_metrics()
lasso_test_metrics
```
As seen above, the final lasso model has mean absolute error of 0.27, root mean squared error of 0.37 and R squared on 48,7% on the test data.

To assess the importance of the predictor variables, model parameter estimates are calculated below:
```{r}
# Fit the model on the training data and pull the model coefficients for the variables
lasso_wf_tuned %>% fit(data_train) %>% pull_workflow_fit() %>% tidy() 
```

As lasso performs subset selection automatically, some variables have a coefficient of zero. There are multiple variables with coefficient of zero, which implies that these variables are less important for the price prediction of new Airbnb listings. The most important variables can be identified by looking at the coefficients as well, and the 4 most important variables are number of accommodates, the number of days that the airbnb is available inside 30 days, room type of entire home apartment, and lastly, Centrum-West neighbourhood.

# Random Forest

### Random forest specification 

#### Recipe
Within this section, a random forest will be created. First a preprocessing recipe is created. 
The id variable is updated to a seperate role, instead of being a predictor. 
```{r Recipe}
# Specify recipe
rf_recipe <- recipe(price ~ ., data = data_train) %>%
  update_role(id, new_role = "id var")
rf_recipe
```

#### Tune specifications

Within this section the tune specification are mentioned. The *mtry* is the number of features that are used at each split. The exact mtry value will be tuned later on. Different values for trees where tested (200, 500 & 1000). Increasing the amount of trees did not have much impact on the results. Therefore, a tree size of 200 is chosen to save computational time. 
```{r message=FALSE}
# Tune specification
rf_tune_spec <- rand_forest(mtry = tune(), trees = 200) %>%
  set_engine("ranger") %>%
  set_mode("regression")
```

The recipe and the model are combined into a workflow that can be tuned.
```{r message=FALSE}
# Workflow creation
rf_tune_wf <- workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_tune_spec)
```

A metric set is specified, which calculates the *Root Mean Square Error (RMSE)*, *the Mean Absolute Error (MAE)* and the *R-squared (rsq_trad)* is created. 
```{r Class metrics}
# Class metrics specification 
class_metrics <- metric_set(rmse, mae, rsq_trad)
```

The command bellow allows us to do computations in parallel.
```{r message=FALSE}
registerDoParallel()
```

The command grid = tibble(mtry = 1:33) was utlised, as this allows the random forest to consider all number of variables at each split. Based on the MAE criteria, a mtry of 5, 6, 7, 8 & 9 was found as the optimal solution. Afterwards, a mtry of c(1:10)) is taken that will include the optimal values, as well mtry values of 1 up until 4 and a mtry of 10. This allows us to see that the mtry is initially increased up until it reaches its optimal mtry solution.  
```{r message=FALSE}
# Define the tune grid 
rf_tunegrid <- tibble(mtry = c(1:10))
```

```{r Grid}
# Tune the grid
set.seed(12345)
rf_tune_res <- tune_grid(
  rf_tune_wf,
  resamples = data_folds,
  grid = rf_tunegrid,
  metrics = class_metrics
)
rf_tune_res
```

### Selecting tuning parameters 
```{r Collect metrics}
# Collect metrics
rf_tune_res %>%
  collect_metrics()
```

This plot illustrates the best mtry, based on the criteria of the MAE. A lower MAE would indicate a better result, as a lower value indicates a lower error of prediction. 
```{r Plot MAE}
# Plot results all metrics
rf_tune_res %>%
  collect_metrics() %>%
  filter(.metric %in% c("rmse", "mae", "rsq_trad")) %>%
  ggplot(aes(x = mtry, y = mean, ymin = mean - std_err, ymax = mean + std_err, 
             colour = .metric)) +
  geom_errorbar() + 
  geom_line() +
  geom_point() +
  facet_grid(.metric ~ ., scales = "free_y") +
  labs(title = "Random Forest performance metrics") +
  theme(plot.title = element_text(hjust = 0.5))
```

```{r}
# Plot the MAE based
rf_tune_res %>%
  collect_metrics() %>%
  filter(.metric == "mae") %>% 
  ggplot(aes(x = mtry, y = mean, ymin = mean - std_err, ymax = mean + std_err)) +
  geom_errorbar(colour = "red") + 
  geom_line(colour = "red") +
  geom_point(colour = "red") +
  labs(y = "mae", title = "Random Forest performance metrics") +
  theme(plot.title = element_text(hjust = 0.5))
```

This command will show the best mtry based on the MAE metrics. 
```{r Best MAE}
# Find the mtry with the best MAE
rf_tune_res %>% show_best("mae")
```

### Best model selection

The best model based on the MAE criteria is selected and eventually finalized into the workflow. 
```{r Best model}
# Best model selection
best_rmse <- select_best(rf_tune_res, "mae")
final_rf <- finalize_workflow(rf_tune_wf, best_rmse)
final_rf
```

#### Test set performance

Now we can train the finalized workflow on our entire training set
```{r Finalise}
# Finalize workflow on training set
final_res <- final_rf %>%
  last_fit(data_split, metrics = class_metrics)
```

The results based on the test set will be 
```{r Test results}
# Score on test data
set.seed(54321)
final_res %>%
  collect_metrics()
```

#### Variable importance 

Now we try to asses the variable importance. We will refit the model based on our previous tune parameters. We previousyly found an optimal mtry of 7, that's why the mtry is specified as 7. However, do keep in mind that because of the random element within a random forest, that this initial value might alter. We noticed that the optimal mtry switches between 6, 7 & 8. 
```{r Refit}
# Refit the model
rf_model_vi <- rand_forest(mtry = 7, trees = 200) %>%
  set_engine("ranger", importance = "permutation")

rf_vi_wf <- workflow() %>% 
  add_model(rf_model_vi) %>% 
  add_recipe(rf_recipe)

# Fit the model again
set.seed(12345)
rf_vi_fit <- rf_vi_wf %>% fit(data = data_train)
```

We can use the refitted model in order the gather the variable importance 
```{r Variable importance}
# Variable importance 
rf_vi_fit %>% pull_workflow_fit() %>% vi()
```

The variable importance indicates that the accommodates, bedrooms and room_type are the most important variables for predicting the price. The variables which are the least important for predicting the price are bed_type, pool, and wifi. Pool and wifi actually have a negative importance, but as this is close to a value of 0, it is chosen to still include those variables.

```{r}
# Plot variable importance
var_importance_plot <-
  rf_vi_fit %>%
  pull_workflow_fit() %>% vip(geom = "point", num_features = 12) +
  labs(title = "Random Forest Variable Importance") +
  theme(plot.title = element_text(hjust = 0.5)) 
rf_vi_fit

# Save plot for presentation
ggsave("plots/rf_var_importance.png", plot = var_importance_plot,
       height = 7 , width = 10)

```

## K-Nearest-Neighbors

### Set up turning grid
```{r}
# Generate tuning grid for knn
knn_tune_grid <- tibble(neighbors = 1:50*2-1)
```

### Specify a workflow 

Something that should be noted for this recipe is that only numeric variables are included. This is done for the reason that categorical variables translate with difficulty to a k-nearest neighbor algorithm. The premise of prediction based on a KNN-model is that it relies exclusively on the distance between points in the data. This distance is obvious when handling numeric variables. However, when dealing with non-numeric values and variables this distance between data points cannot easily be modeled, provided they should be modeled at all. (This will have implications for determining predictions for importance and coefficients for variables, which will be addressed at the end of the section on the KNN-model).
```{r}
# Specify model 
knn_mod <- 
  nearest_neighbor(neighbors = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("kknn", scale=FALSE)
knn_mod

# Specify recipe
knn_recipe <- 
  recipe(price ~ ., data = data_train) %>% 
  step_rm(all_nominal()) %>%
  update_role(id, new_role = "id var") %>% 
  step_normalize(all_predictors(), -id)
knn_recipe
```

The normalization of the data is ensured through the following commands:
```{r}
# Check normalization 
train_baked <- knn_recipe %>% prep(data_train) %>% bake(data_train)
train_baked %>% head()
round(colMeans(train_baked, 8))
round(apply(train_baked, 2, sd), 8)
rm(train_baked)
```

Below the initial workflow for the k-nearest neighbors model is specified
```{r}
# Specify workflow
knn_workflow <- 
  workflow() %>% 
  add_model(knn_mod) %>% 
  add_recipe(knn_recipe)
knn_workflow
```

### Tuning the number of nearest neighbours

The code below serves to specify the assessment metrics that are used. Moreover, a grid search is performed using the validation sets. 
```{r message = FALSE}
# Store metrics in variable
metrics_reg <- metric_set(rmse, mae, rsq_trad)

# Perform grid search using validation sets
knn_tune_res <- 
  knn_workflow %>% 
  tune_grid(resamples = data_folds,
            grid = knn_tune_grid, 
            metrics = metrics_reg) 

# Plot results metrics
knn_metrics_plot <- 
  knn_tune_res %>%  collect_metrics() %>% 
  ggplot(aes(x = neighbors, y = mean)) +
  geom_point(colour = "red") + geom_line(colour = "red") +
  facet_wrap(~ .metric, scale = "free_y") +
  labs(title = "KNN Performance Metrics") +
  theme(plot.title = element_text(hjust = 0.5)) 
knn_metrics_plot
```

The plot output shows some metrics that plot the mean of the performance metrics. We should aim for the MAE (mean absolute error) and RMSE (root mean square error) to be as low as possible, and the rsq_trad (R-squared) to be as high as possible. We used the MAE metric to determine the optimal k-neighbors for our model, which arrived at 51 neighbors. This can be read from the MAE graph, by looking at the corresponsing k-neighbors for the lowest mean of MAE. 

Moreover, from the last plot the elbow trend can somewhat clearly be seen: the metrics reach their optimum point after which the level off and slowly increase for MAE and RMSE and decrease for rqs_trad.

The model with the optimal number of k-nearest neighbors can then be selected as follows:
```{r}
# Generate best model
knn_best_model <- select_best(knn_tune_res, metric = "mae")
```

### Finalize workflow

Below the finalized workflow is made, which automatically picks the best KNN-model defined above (which is specified by the MAE metric)
```{r}
# Finalize workflow
knn_workflow_final <- 
  knn_workflow %>% 
  finalize_workflow(knn_best_model)
knn_workflow_final
```

### Last fit 

A final workflow can be set up to check the final fit. Furthermore, the performance metrics for the best KNN-model are selected and put in a table.
```{r}
# Train and test the data set
knn_last_fit <- 
  knn_workflow_final %>% 
  last_fit(data_split, 
           metrics = metrics_reg)
```

### KNN variable importance 

KNN, as a method, does not come with a prediction for the importance or coefficients of variables. The reason for this has to do with the fact that prediction in a k-nearest neighbor model relies exclusively on the distance between data points. With this comes the added implication that no information about the relative importance of variables can be derived from it.


# Part 3: Model Comparison

In order to assess the performance of the three models the results of three metrics are compared. 

1. Root mean squared error. The objective is to minimize the result of this metric. An implication is that the metric is very sensitive to observations with large absolute residuals. 
2. Mean absolute error. The objective is to minimize the result of this metric. This metric is more robust and therefore less sensitive to observations with large absolute residuals. However, because it uses the mean, it is therefore not insensitive to skewed distributions. 
3. R-squared. An attraction is that the metric is unitless and can therefore be compared across models. A downside is that is not robust, since it essentially measures correlation and not agreement. 

## Assess appropriateness metrics 

In order to check if the metrics are an appropriate fit, the residuals distributions of the models are plotted. Since the plots below show no large outliers and no skewed distributions, there are no implications for root mean squared error and mean absolute error. 
```{r echo = TRUE}
# Generate predicted values for sales
lasso_test_preds <- 
  lasso_wf_tuned %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
lasso_pred <- 
  tibble(observed = data_test$price, 
         predicted = lasso_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
lasso_residual_plot <- 
  lasso_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "Lasso Regularized Regression Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
lasso_residual_plot
```

```{r}
# Generate predicted values for sales
set.seed(12345)
rf_test_preds <- 
  rf_vi_wf %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
rf_pred <- 
  tibble(observed = data_test$price, 
         predicted = rf_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
rf_residual_plot <- 
  rf_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "Random Forest Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
rf_residual_plot
```

```{r}
# Generate predicted values for price
knn_test_preds <- 
  knn_workflow_final %>% 
  fit(data = data_train) %>%
  predict(data_test) %>% 
  pull(.pred)

# Create tibble for distribution plot
knn_pred <- 
  tibble(observed = data_test$price, 
         predicted = knn_test_preds, 
         residual = observed - predicted)

# Plot distribution residuals
knn_residual_plot <- 
  knn_pred %>% 
  ggplot(aes(x = residual)) +
  geom_density(bw = 0.15, fill = "springgreen", alpha = 0.5) +
  geom_rug() +
  labs(title = "KNN Distribution Residuals") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(ylim = c(0, 1.5))
knn_residual_plot

```

## Assessment metrics

The table below shows results for the metrics for the three models. When assessing the metrics and their objectives, the results show that the random forest model performs best on all metrics. Therefore, the random forest is considered to be the best model. 
```{r}
# Generate table with 
lasso_metrics_compare <- 
  lasso_test_metrics %>% 
  select(-.estimator) %>% 
  mutate(model = "lasso regression")
rf_metrics_compare <- 
  final_res %>%
  collect_metrics() %>% 
  mutate(model = "random forest")
knn_metrics_compare <- 
  knn_last_fit %>% 
  collect_metrics %>% 
  select(-.estimator) %>% 
  mutate(model = "knn reg")
lasso_metrics_compare %>%
  bind_rows(rf_metrics_compare, knn_metrics_compare) %>% 
  select(-.estimator) %>% 
  pivot_wider(names_from = .metric, values_from = .estimate)
```

# Part 4: Recommendations

For this assignment we assessed three models for predicting prices on the Airbnb platform. These three models are a lasso regularized linear regression, a random forest and finally a K-Nearest Neighbours model. The models are assessed on three different metrics, namely the root mean squared error (RMSE), the mean squared error (MAE) and the r-squared. The proposed assessment metrics indicate that the random forest outperforms the lasso regularized linear regression and the K-Nearest-Neighbor in all proposed metrics.

Furthermore, the variable importance of the random forest indicates which variables lead to the highest increase in R-squared. Accommodates, bedrooms and room_type are the most important variables for predicting the price. Bed_type, pool, and wifi are the least important for predicting the price. We therefore advise hosts to specifically look at the variables with a high importance, as these variables have the highest impact on their predicted price.

As the model performance metrics were used to compare the models and to select the best one, these metrics should not be used to estimate the generalization error, or how well the model would perform on completely new data. For this purpose, it is recommended that the random forest model will be further tested in an online environment before scaling up its implementation. Additionally, in practice, the model will provide the prediction in log prices, and this has to be converted to euros. For example, a log price prediction of 4.09 would reflect price of €59.74.

How are these findings of use to Airbnb? Essentially, this project has sought to examine three approaches to competitively set listing prices of Airbnb listings. Considering our results, we advise Airbnb to use a random forest model in their projects that deal with optimal pricing of listings.  


